Average word length |
---|
5.0175 |
word length | percentage |
---|---|
1 | 2.7192 |
2 | 17.1532 |
3 | 18.9676 |
4 | 14.9626 |
5 | 9.1600 |
6 | 7.7932 |
7 | 8.4004 |
8 | 7.9670 |
9 | 5.1026 |
10 | 3.3311 |
11 | 2.0977 |
12 | 1.0905 |
13 | 0.7191 |
14 | 0.2953 |
15 | 0.1071 |
16 | 0.0452 |
17 | 0.0333 |
18 | 0.0167 |
19 | 0.0095 |
20 | 0.0167 |
21 | 0.0024 |
22 | 0.0024 |
24 | 0.0048 |
30 | 0.0024 |
In this subsection we use the different frequencies for words. For a fixed word length, we use different words having this length and count them with their individual multiplicity.
The fact that stopwords are very high frequent and short will give a shorter average word length than in the previous subsection.
The plot of the word length against the number of words of this length (counted with multiplicity) has its maximum around 5, depending on the language. Moreover, with a logarithmic scale of the y-axis, we get a nearly linear part between length 10 and 35.
Average word length is one of the classic parameters for a language.
Counting with multiplicity makes average word length independent of the corpus size.
Average word length:
SELECT @all:=count(*) from words where w_id>100;
select sum(char_length(word)*freq)/@all from words where w_id>100;
Data for large table:
SELECT @all:=count(*) from words where w_id>100;
select char_length(word), 100*sum(freq)/@all from words where w_id>100 and 50>=char_length(word) group by char_length(word);
As in 3.5.2.1, the nearly linear part in the graph needs an explanation.
3.5.1.1 Words by Length without multiplicity